R# Datasets included in base R
library(help = "datasets")
# All datasets included in the packages you have installed
data(package = .packages(all.available = TRUE))
First, load the tidyverse package
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)
Missing data can be problematic in analysis, and it is crucial to
identify and handle them appropriately. In R, missing data is typically
represented as NA. However, datasets may encode missing
values in other ways, such as "N/A", "-",
".", or 999. We can, and should, convert these
into proper NAvalues.
Checking for Missing Data:
The is.na() function is useful for identifying missing
values in a dataset:
# Example dataset
data <- data.frame(
id = 1:5,
value = c(10, "N/A", 15, "-", 20)
)
# Convert character encodings of missing values to NA
data$value[data$value %in% c("N/A", "-")] <- NA
# Convert column to numeric type
data$value <- as.numeric(data$value)
# Check missing data
is.na(data)
## id value
## [1,] FALSE FALSE
## [2,] FALSE TRUE
## [3,] FALSE FALSE
## [4,] FALSE TRUE
## [5,] FALSE FALSE
There are multiple strategies to handle missing data:
Omitting missing data: Use
na.omit() to remove rows with NA values. This is often NOT
a good idea. It requires knowing your data, where your missing data
values are coming from, why they are missing, and what your analytical
goals are.
Mean substitution: Replace missing values with the mean of the non-missing values.
Data imputation: The process of replacing missing values in a dataset with estimated values based on other available data. This helps maintain dataset completeness and reduces bias introduced by missing observations. The process can range from simple methods like replacing missing values with the mean, median, or mode of a variable to more complex approaches such as regression imputation, k-nearest neighbors (KNN), or multiple imputation, which use predictive modeling techniques to estimate missing values. The choice of method depends on the nature of the data and the assumptions that can be made about the missing values.
Quick Note: In R, indexing with brackets
[ ]allows you to access specific elements within vectors, matrices, and data frames. For vectors, a single bracket[ ]is used to extract elements by position, logical condition, or name. For example,x[2]retrieves the second element of a vectorx, whilex[x > 5]returns elements greater than 5. For matrices and data frames, two brackets[ , ]are used to specify rows and columns.df[1, 2]retrieves the value in the first row and second column ofdf, whiledf[ , "column_name"]extracts an entire column by name. Usingdf[ , 2]returns the second column, anddf[1, ]retrieves the entire first row.
Example:
# Mean substitution
data$value[is.na(data$value)] <- mean(data$value, na.rm = TRUE)
# Omit missing data
data_complete <- na.omit(data)
The stringr package helps in handling and manipulating
character data. Let’s use the starwars data set from
dplyr.
Example:
glimpse(starwars)
## Rows: 87
## Columns: 14
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
## $ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
## $ films <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
# Remove white spaces (there are none)
starwars$name <- trimws(starwars$name)
starwars$name
## [1] "Luke Skywalker" "C-3PO" "R2-D2"
## [4] "Darth Vader" "Leia Organa" "Owen Lars"
## [7] "Beru Whitesun Lars" "R5-D4" "Biggs Darklighter"
## [10] "Obi-Wan Kenobi" "Anakin Skywalker" "Wilhuff Tarkin"
## [13] "Chewbacca" "Han Solo" "Greedo"
## [16] "Jabba Desilijic Tiure" "Wedge Antilles" "Jek Tono Porkins"
## [19] "Yoda" "Palpatine" "Boba Fett"
## [22] "IG-88" "Bossk" "Lando Calrissian"
## [25] "Lobot" "Ackbar" "Mon Mothma"
## [28] "Arvel Crynyd" "Wicket Systri Warrick" "Nien Nunb"
## [31] "Qui-Gon Jinn" "Nute Gunray" "Finis Valorum"
## [34] "Padmé Amidala" "Jar Jar Binks" "Roos Tarpals"
## [37] "Rugor Nass" "Ric Olié" "Watto"
## [40] "Sebulba" "Quarsh Panaka" "Shmi Skywalker"
## [43] "Darth Maul" "Bib Fortuna" "Ayla Secura"
## [46] "Ratts Tyerel" "Dud Bolt" "Gasgano"
## [49] "Ben Quadinaros" "Mace Windu" "Ki-Adi-Mundi"
## [52] "Kit Fisto" "Eeth Koth" "Adi Gallia"
## [55] "Saesee Tiin" "Yarael Poof" "Plo Koon"
## [58] "Mas Amedda" "Gregar Typho" "Cordé"
## [61] "Cliegg Lars" "Poggle the Lesser" "Luminara Unduli"
## [64] "Barriss Offee" "Dormé" "Dooku"
## [67] "Bail Prestor Organa" "Jango Fett" "Zam Wesell"
## [70] "Dexter Jettster" "Lama Su" "Taun We"
## [73] "Jocasta Nu" "R4-P17" "Wat Tambor"
## [76] "San Hill" "Shaak Ti" "Grievous"
## [79] "Tarfful" "Raymus Antilles" "Sly Moore"
## [82] "Tion Medon" "Finn" "Rey"
## [85] "Poe Dameron" "BB8" "Captain Phasma"
# Remove only the first the blank space
starwars$name <- str_remove(starwars$name, " ")
starwars$name
## [1] "LukeSkywalker" "C-3PO" "R2-D2"
## [4] "DarthVader" "LeiaOrgana" "OwenLars"
## [7] "BeruWhitesun Lars" "R5-D4" "BiggsDarklighter"
## [10] "Obi-WanKenobi" "AnakinSkywalker" "WilhuffTarkin"
## [13] "Chewbacca" "HanSolo" "Greedo"
## [16] "JabbaDesilijic Tiure" "WedgeAntilles" "JekTono Porkins"
## [19] "Yoda" "Palpatine" "BobaFett"
## [22] "IG-88" "Bossk" "LandoCalrissian"
## [25] "Lobot" "Ackbar" "MonMothma"
## [28] "ArvelCrynyd" "WicketSystri Warrick" "NienNunb"
## [31] "Qui-GonJinn" "NuteGunray" "FinisValorum"
## [34] "PadméAmidala" "JarJar Binks" "RoosTarpals"
## [37] "RugorNass" "RicOlié" "Watto"
## [40] "Sebulba" "QuarshPanaka" "ShmiSkywalker"
## [43] "DarthMaul" "BibFortuna" "AylaSecura"
## [46] "RattsTyerel" "DudBolt" "Gasgano"
## [49] "BenQuadinaros" "MaceWindu" "Ki-Adi-Mundi"
## [52] "KitFisto" "EethKoth" "AdiGallia"
## [55] "SaeseeTiin" "YaraelPoof" "PloKoon"
## [58] "MasAmedda" "GregarTypho" "Cordé"
## [61] "ClieggLars" "Pogglethe Lesser" "LuminaraUnduli"
## [64] "BarrissOffee" "Dormé" "Dooku"
## [67] "BailPrestor Organa" "JangoFett" "ZamWesell"
## [70] "DexterJettster" "LamaSu" "TaunWe"
## [73] "JocastaNu" "R4-P17" "WatTambor"
## [76] "SanHill" "ShaakTi" "Grievous"
## [79] "Tarfful" "RaymusAntilles" "SlyMoore"
## [82] "TionMedon" "Finn" "Rey"
## [85] "PoeDameron" "BB8" "CaptainPhasma"
# Remove ALL the blank spaces
starwars$name <- str_remove_all(starwars$name, " ")
starwars$name
## [1] "LukeSkywalker" "C-3PO" "R2-D2"
## [4] "DarthVader" "LeiaOrgana" "OwenLars"
## [7] "BeruWhitesunLars" "R5-D4" "BiggsDarklighter"
## [10] "Obi-WanKenobi" "AnakinSkywalker" "WilhuffTarkin"
## [13] "Chewbacca" "HanSolo" "Greedo"
## [16] "JabbaDesilijicTiure" "WedgeAntilles" "JekTonoPorkins"
## [19] "Yoda" "Palpatine" "BobaFett"
## [22] "IG-88" "Bossk" "LandoCalrissian"
## [25] "Lobot" "Ackbar" "MonMothma"
## [28] "ArvelCrynyd" "WicketSystriWarrick" "NienNunb"
## [31] "Qui-GonJinn" "NuteGunray" "FinisValorum"
## [34] "PadméAmidala" "JarJarBinks" "RoosTarpals"
## [37] "RugorNass" "RicOlié" "Watto"
## [40] "Sebulba" "QuarshPanaka" "ShmiSkywalker"
## [43] "DarthMaul" "BibFortuna" "AylaSecura"
## [46] "RattsTyerel" "DudBolt" "Gasgano"
## [49] "BenQuadinaros" "MaceWindu" "Ki-Adi-Mundi"
## [52] "KitFisto" "EethKoth" "AdiGallia"
## [55] "SaeseeTiin" "YaraelPoof" "PloKoon"
## [58] "MasAmedda" "GregarTypho" "Cordé"
## [61] "ClieggLars" "PoggletheLesser" "LuminaraUnduli"
## [64] "BarrissOffee" "Dormé" "Dooku"
## [67] "BailPrestorOrgana" "JangoFett" "ZamWesell"
## [70] "DexterJettster" "LamaSu" "TaunWe"
## [73] "JocastaNu" "R4-P17" "WatTambor"
## [76] "SanHill" "ShaakTi" "Grievous"
## [79] "Tarfful" "RaymusAntilles" "SlyMoore"
## [82] "TionMedon" "Finn" "Rey"
## [85] "PoeDameron" "BB8" "CaptainPhasma"
# Remove ALL the dashes
starwars$name <- str_remove_all(starwars$name, "-")
starwars$name
## [1] "LukeSkywalker" "C3PO" "R2D2"
## [4] "DarthVader" "LeiaOrgana" "OwenLars"
## [7] "BeruWhitesunLars" "R5D4" "BiggsDarklighter"
## [10] "ObiWanKenobi" "AnakinSkywalker" "WilhuffTarkin"
## [13] "Chewbacca" "HanSolo" "Greedo"
## [16] "JabbaDesilijicTiure" "WedgeAntilles" "JekTonoPorkins"
## [19] "Yoda" "Palpatine" "BobaFett"
## [22] "IG88" "Bossk" "LandoCalrissian"
## [25] "Lobot" "Ackbar" "MonMothma"
## [28] "ArvelCrynyd" "WicketSystriWarrick" "NienNunb"
## [31] "QuiGonJinn" "NuteGunray" "FinisValorum"
## [34] "PadméAmidala" "JarJarBinks" "RoosTarpals"
## [37] "RugorNass" "RicOlié" "Watto"
## [40] "Sebulba" "QuarshPanaka" "ShmiSkywalker"
## [43] "DarthMaul" "BibFortuna" "AylaSecura"
## [46] "RattsTyerel" "DudBolt" "Gasgano"
## [49] "BenQuadinaros" "MaceWindu" "KiAdiMundi"
## [52] "KitFisto" "EethKoth" "AdiGallia"
## [55] "SaeseeTiin" "YaraelPoof" "PloKoon"
## [58] "MasAmedda" "GregarTypho" "Cordé"
## [61] "ClieggLars" "PoggletheLesser" "LuminaraUnduli"
## [64] "BarrissOffee" "Dormé" "Dooku"
## [67] "BailPrestorOrgana" "JangoFett" "ZamWesell"
## [70] "DexterJettster" "LamaSu" "TaunWe"
## [73] "JocastaNu" "R4P17" "WatTambor"
## [76] "SanHill" "ShaakTi" "Grievous"
## [79] "Tarfful" "RaymusAntilles" "SlyMoore"
## [82] "TionMedon" "Finn" "Rey"
## [85] "PoeDameron" "BB8" "CaptainPhasma"
# Make lowercase
starwars$name <- str_to_lower(starwars$name)
starwars$name
## [1] "lukeskywalker" "c3po" "r2d2"
## [4] "darthvader" "leiaorgana" "owenlars"
## [7] "beruwhitesunlars" "r5d4" "biggsdarklighter"
## [10] "obiwankenobi" "anakinskywalker" "wilhufftarkin"
## [13] "chewbacca" "hansolo" "greedo"
## [16] "jabbadesilijictiure" "wedgeantilles" "jektonoporkins"
## [19] "yoda" "palpatine" "bobafett"
## [22] "ig88" "bossk" "landocalrissian"
## [25] "lobot" "ackbar" "monmothma"
## [28] "arvelcrynyd" "wicketsystriwarrick" "niennunb"
## [31] "quigonjinn" "nutegunray" "finisvalorum"
## [34] "padméamidala" "jarjarbinks" "roostarpals"
## [37] "rugornass" "ricolié" "watto"
## [40] "sebulba" "quarshpanaka" "shmiskywalker"
## [43] "darthmaul" "bibfortuna" "aylasecura"
## [46] "rattstyerel" "dudbolt" "gasgano"
## [49] "benquadinaros" "macewindu" "kiadimundi"
## [52] "kitfisto" "eethkoth" "adigallia"
## [55] "saeseetiin" "yaraelpoof" "plokoon"
## [58] "masamedda" "gregartypho" "cordé"
## [61] "cliegglars" "pogglethelesser" "luminaraunduli"
## [64] "barrissoffee" "dormé" "dooku"
## [67] "bailprestororgana" "jangofett" "zamwesell"
## [70] "dexterjettster" "lamasu" "taunwe"
## [73] "jocastanu" "r4p17" "wattambor"
## [76] "sanhill" "shaakti" "grievous"
## [79] "tarfful" "raymusantilles" "slymoore"
## [82] "tionmedon" "finn" "rey"
## [85] "poedameron" "bb8" "captainphasma"
Recoding variables helps in transforming data for better analysis. We
can use mutate() and case_when() from the
dplyr package.
Example:
# Look at our data
table(starwars$species)
##
## Aleena Besalisk Cerean Chagrian Clawdite
## 1 1 1 1 1
## Droid Dug Ewok Geonosian Gungan
## 6 1 1 1 3
## Human Hutt Iktotchi Kaleesh Kaminoan
## 35 1 1 1 2
## Kel Dor Mirialan Mon Calamari Muun Nautolan
## 1 2 1 1 1
## Neimodian Pau'an Quermian Rodian Skakoan
## 1 1 1 1 1
## Sullustan Tholothian Togruta Toong Toydarian
## 1 1 1 1 1
## Trandoshan Twi'lek Vulptereen Wookiee Xexto
## 1 2 1 2 1
## Yoda's species Zabrak
## 1 2
# Recode a qualitative variable's value labels:
starwars <- starwars %>%
mutate(species = case_when(
species != "Human" ~ "NonHuman",
species == "Human" ~ "Human"
))
# Look at the recoded data
table(starwars$species)
##
## Human NonHuman
## 35 48
# Recode a qualitative variable `mass_group` from a quantitative variable `mass`:
## Histogram of mass of charachters
hist(starwars$mass, breaks = 20, xlab = "Mass", main = "Historgram of mass of StarWars charachters")
abline(v = mean(starwars$mass, na.rm = TRUE), col = "red", lwd = 2)
## Recode mass
starwars <- starwars %>%
mutate(mass_group = case_when(
mass < mean(mass, na.rm = TRUE) ~ "BelowMean",
mass >= mean(mass, na.rm = TRUE) ~ "AboveMean"
))
## Look at recoded mass data
table(starwars$mass_group)
##
## AboveMean BelowMean
## 10 49
## If we wanted to rename all the variables:
df <- tibble("First Name" = c("Alice", "Bob"), "Last Name" = c("Smith", "Jones"))
# Remove spaces and convert to camelCase
colnames(df) <- str_replace_all(colnames(df), " ", "_")
colnames(df) <- str_to_lower(colnames(df))
print(df)
## # A tibble: 2 × 2
## first_name last_name
## <chr> <chr>
## 1 Alice Smith
## 2 Bob Jones
Selecting specific variables from a dataset helps streamline analysis
and is useful for keeping your distinct types of data in their own
dataframes. The select() function from dplyr allows us to
extract specific columns from a dataset.
Example:
starwars_data <- starwars
# Select specific columns
starwars_data_anthropometrics <- starwars_data %>%
select(name, height, mass) # Select only these specific columns
print(head(starwars_data_anthropometrics))
## # A tibble: 6 × 3
## name height mass
## <chr> <int> <dbl>
## 1 lukeskywalker 172 77
## 2 c3po 167 75
## 3 r2d2 96 32
## 4 darthvader 202 136
## 5 leiaorgana 150 49
## 6 owenlars 178 120
# Select columns based on a pattern
starwars_data_demographics <- starwars_data %>%
select(name, contains("_")) # Selects name and any column that contains an underscore
print(head(starwars_data_demographics))
## # A tibble: 6 × 6
## name hair_color skin_color eye_color birth_year mass_group
## <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 lukeskywalker blond fair blue 19 BelowMean
## 2 c3po <NA> gold yellow 112 BelowMean
## 3 r2d2 <NA> white, blue red 33 BelowMean
## 4 darthvader none white yellow 41.9 AboveMean
## 5 leiaorgana brown light brown 19 BelowMean
## 6 owenlars brown, grey light blue 52 AboveMean
Merging datasets is a key skill when dealing with real-world data. We
will focus on left_join() and inner_join()
from dplyr.
A left_join() keeps all rows from the left dataset and
adds matching data from the right dataset. If no match is found,
NA values are introduced.
# Example datasets
data1 <- data.frame(id = 1:5, name = c("Alice", "Bob", "Charlie", "David", "Eva"),
score = c(85, 90, 88, 92, 95))
data2 <- data.frame(id = c(3, 4, 5, 6),
grade = c("B+", "A", "A-", "B"))
head(data1)
## id name score
## 1 1 Alice 85
## 2 2 Bob 90
## 3 3 Charlie 88
## 4 4 David 92
## 5 5 Eva 95
head(data2)
## id grade
## 1 3 B+
## 2 4 A
## 3 5 A-
## 4 6 B
# Left Join
data_merged <- left_join(data1, data2, by = "id")
print(data_merged)
## id name score grade
## 1 1 Alice 85 <NA>
## 2 2 Bob 90 <NA>
## 3 3 Charlie 88 B+
## 4 4 David 92 A
## 5 5 Eva 95 A-
An inner_join() keeps only the rows that have matching
keys in both datasets.
# Inner Join
data_inner <- inner_join(data1, data2, by = "id")
print(data_inner)
## id name score grade
## 1 3 Charlie 88 B+
## 2 4 David 92 A
## 3 5 Eva 95 A-
Select all countries.
Select at least two specific variables (under Series), e.g., GPD (Current US$), People using safely mangaged drinking water services (% of population)
Click the
ifor operational definition: GDP at purchaser’s prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in current U.S. dollars. Dollar figures for GDP are converted from domestic currencies using single year official exchange rates. For a few countries where the official exchange rate does not reflect the rate effectively applied to actual foreign exchange transactions, an alternative conversion factor is used.
Select one year (e.g., 2023) (or two if you want a challenge!).
You might need to click on Apply changes.
Question: Do you need to do any editing to the CSV file before it is ready to read in to R?
Download additional country-level metadata (available (here)[https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv] or here)
url <- "https://raw.githubusercontent.com/zhgarfield/raw2refined/main/data/country_metadata.csv"
country_metadata <- read.csv(url, stringsAsFactors = FALSE)
# Check the first few rows
head(country_metadata)
## name alpha.2 alpha.3 country.code iso_3166.2 region
## 1 Afghanistan AF AFG 4 ISO 3166-2:AF Asia
## 2 Ã…land Islands AX ALA 248 ISO 3166-2:AX Europe
## 3 Albania AL ALB 8 ISO 3166-2:AL Europe
## 4 Algeria DZ DZA 12 ISO 3166-2:DZ Africa
## 5 American Samoa AS ASM 16 ISO 3166-2:AS Oceania
## 6 Andorra AD AND 20 ISO 3166-2:AD Europe
## sub.region intermediate.region region.code sub.region.code
## 1 Southern Asia 142 34
## 2 Northern Europe 150 154
## 3 Southern Europe 150 39
## 4 Northern Africa 2 15
## 5 Polynesia 9 61
## 6 Southern Europe 150 39
## intermediate.region.code
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
How do these data violate tidy data principles?
What do we need to do to make them tidy? * Do we need to reshape? * Is there missing data to omit? * Do you want to adjust variable names? * Any name changes to make match? * Join?
Tidy your data!